The data set, from 2008, is comprised of white wines known as Vinho Verde (green wine). Vinho Verde (VV) refers to the Minho region of northern Portugal; it is not a type of grape. This region is known for being cooler and wetter than the rest of the country in the winter, although it does get hot in the summer.
VV wines have traditionally been known for being light, crisp and low in alchohol, and were meant to be drunk young (when they’re “green”). Lately, however, VV’s reputation of being a cheap & cheerful wine has begun to change.
Within this region, there are nine sub-regions.
As you can see from the map, the topgraphy of the VV region is varied, as well as for several of its subregions. Vinho verde wines can come from the Atlantic coast, the mountains, or inland plains. Soils vary, as do the type of grapes grown.
How large is this data set?
## [1] 4898 13
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
## 'data.frame': 4898 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
The data is either integer or numeric.
Alas, none of the categorical information mentioned above, such as soil type, grape variety, or subregion, is included in this data set.
Since we need at least one categorical variable for this dataset, let’s convert quality.
## [1] "3" "4" "5" "6" "7" "8" "9"
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1 Min. : 3.800 Min. :0.0800 Min. :0.0000
## 1st Qu.:1225 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700
## Median :2450 Median : 6.800 Median :0.2600 Median :0.3200
## Mean :2450 Mean : 6.855 Mean :0.2782 Mean :0.3342
## 3rd Qu.:3674 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900
## Max. :4898 Max. :14.200 Max. :1.1000 Max. :1.6600
##
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.600 Min. :0.00900 Min. : 2.00
## 1st Qu.: 1.700 1st Qu.:0.03600 1st Qu.: 23.00
## Median : 5.200 Median :0.04300 Median : 34.00
## Mean : 6.391 Mean :0.04577 Mean : 35.31
## 3rd Qu.: 9.900 3rd Qu.:0.05000 3rd Qu.: 46.00
## Max. :65.800 Max. :0.34600 Max. :289.00
##
## total.sulfur.dioxide density pH sulphates
## Min. : 9.0 Min. :0.9871 Min. :2.720 Min. :0.2200
## 1st Qu.:108.0 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100
## Median :134.0 Median :0.9937 Median :3.180 Median :0.4700
## Mean :138.4 Mean :0.9940 Mean :3.188 Mean :0.4898
## 3rd Qu.:167.0 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500
## Max. :440.0 Max. :1.0390 Max. :3.820 Max. :1.0800
##
## alcohol quality
## Min. : 8.00 3: 20
## 1st Qu.: 9.50 4: 163
## Median :10.40 5:1457
## Mean :10.51 6:2198
## 3rd Qu.:11.40 7: 880
## Max. :14.20 8: 175
## 9: 5
There are no NA values. Only citric.acid has zero values. The scale of values varies from the 100s for total.sulfur.dioxide to 1000th decimal place for chlorides.
The first thing I wanted to look at was the quality.
quality## 3 4 5 6 7 8 9
## 20 163 1457 2198 880 175 5
Most wines are average: 74.6% of wines earned a score of 5 or 6.
By contrast, only:
alcohol## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.50 10.40 10.51 11.40 14.20
Alcohol content ranges from 8 to 14.2% by volume.
The distribution looks unusual, not just bimodal but multimodal. The main mode is at 9.5%, but there are smaller, ‘local’ modes at 10.5% and 12% as well.
Dip at 11.5%
Regardless of binwidth, there was a clear dip at 11.5%. Why? Here is the likely explanation (it also explains the modes, max and min values):
The alcohol level of ‘generic’ Vinho Verde must lie between 8% and 11.5% ABV. However, if the wine is labelled with one of the nine sub-regions, which specialise in particular grape varieties, the range extends from 9% to 14% ABV. Additionally, Vinho Verde made from the single varietal Alvarinho can be between 11.5% and 14% ABV. -https://www.alcoholprofessor.com/blog/2014/04/23/vinho-verde-a-splash-of-summer-vinous-joy/
Are alcohol values continuous or discrete?
ggplot(aes(alcohol), data=whites) +
geom_bar(color=I('green')) +
scale_x_continuous(breaks=seq(8,14.5,.5))
Alcohol content % is usually listed to the 10th of a percentage point. Round numbers, and .5 are more common than others.
fixed.acidityIn wine tasting, the term “acidity” refers to the fresh, tart and sour attributes of the wine which are evaluated in relation to how well the acidity balances out the sweetness and bitter components of the wine such as tannins. Three primary acids are found in wine grapes: tartaric, malic and citric acids. — http://winemaking.jackkeller.net/acid.asp
fixed.acidity in our data set only refers to tartaric acid, the most predominant acid in wine. It helps stabilize a wine’s chemical make-up and its colour. It also contributes to taste.
It is measured in g/dm^3, which is a more scientific notation for g/l. Multiply this value by 0.1 to calculate the % by volume.
How to intrepret fixed acidity values:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.800 6.300 6.800 6.855 7.300 14.200
fixed.acidity has a very slight negative but it is hard to see clearly on this plot. There seems to be some negative outliers too.
Let’s try a boxplot instead that also shows the underlying points.
Indeed, there are both negative and positive outliers, but more positive ones and they also extend further. Call this a near-normal distribution.
Do these wines have a higher fixed acidity, in general?
Let’s call this a normal distribution and calculate +/- 1 standard deviation to see where most fixed acidity values fall.
fa.sd <- sd(whites$fixed.acidity)
top <- median(whites$fixed.acidity) + fa.sd
bottom <- median(whites$fixed.acidity) - fa.sd
~68% of fixed.acidity values fall between 0.5956132 to 0.7643868%. In other words, the upper value of the range is slightly higher than normal for white wines.
volatile.acidityVolatile acidity (VA) is primarily a measure of the presence of acetic acid. While a small amount is a natural by-product of fermentation, exposure to oxygen converts alcohol to acetic acid, which is known as oxidization. Too much acetic acid creates a vinegar taste in wine.
A VA of 0.03-0.06% is produced during fermentation and is considered a normal level. (source: http://www.wineperspective.com/the_acidity_of_wine.htm)
Since volatile.acidity is in g/l, multiply the values by 0.1 to calculate the %.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0800 0.2100 0.2600 0.2782 0.3200 1.1000
The distribution is negatively skewed. Let’s transform the x-axis to have a better look at the long tail values.
While most of values are less than 0.03%, there are quite a few outliers above 0.06%, with the highest reaching 0.11%.
citric.acid## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.2700 0.3200 0.3342 0.3900 1.6600
The distribution for citric acid is normal with fewer positive outliers compared to the previous phytochemicals examined.
After reducing the binwidth I noticed some strange spikes at 0, 0.5 and 0.7g/l, and even at 1g/l.
Here is a possible explanation:
In the European Union, use of citric acid for acidification is prohibited, but limited use of citric acid is permitted for removing excess iron and copper from the wine if potassium ferrocyanide is not available. — https://en.wikipedia.org/wiki/Acids_in_wine#Citric_acid
And this is from a 2003 export agreement between Canada and the EU:
- addition of citric acid for wine stabilisation purposes, provided that the final content in the treated wine does not exceed 1 g/l, — http://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:52003PC0377 (section B, item 15)
pHThis is a test of how strong the acidity is. Wines typically have a pH between 2.9 and 3.9. The lower the pH, the more acidic (instead of basic) the wine.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.720 3.090 3.180 3.188 3.280 3.820
In this dataset pH’s distribution is a near-perfect bell curve.
The entire range of pH values is from 2.72 to 3.82, so veers slightly more than usual to the acidic end of the wine pH spectrum.
chloridesIn most wines, the chloride concentration is below 50mg/l, expressed in sodium chloride. It may exceed 1g/l in wine made from grapes grown by the sea.
Sodium chloride is sometimes added during fining, especially when egg whites are used.- Handbook of Enology, The Chemistry of Wine: Stabilization and Treatments
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600
Negatively skewed with a particularly long tail.
Not much more info here. Will try a boxplot.
Transforming the x-axis to log10 shows that choloride values are discrete that beome continuous as chloride levels reach 0.3 g/l (300 mg/l) and a bit beyond. This is higher than normal, but still a far cry from 1 g/l.
I suspect the region’s notorious rain and mist in wintertime results in higher than usual salinity in the soil of coastal subregions, which is then absorbed by the grapes.
Density is what makes wine feel full-bodied. Since VV wines are known for their lightness, I’d expect this dataset to be lower-density than your typical white wine dataset (if I had any to compare it to).
Higher density in wine is usually a result of higher sugar or higher alcohol content.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9917 0.9937 0.9940 0.9961 1.0390
Not suprisingly, since it’s related to sugar & alcohol, density’s distribution is a bit lumpy, but less than for residual.sugar and alcohol. It also has two extreme positive outliers, just like residual.sugar.
This doesn’t help much.
residual.sugarResidual sugar, or lack thereof, in wines can be a sign of a flaw - secondary fermentation. For VV wines, this is considered a feature rather than a flaw.
Outside of Champagne, secondary fermentation in the bottle is a serious problem for winemakers, and one that calls for careful precautions. Besides generating an unpleasant effervescence (bubbly isn’t always better, hate to say), the secondary fermentation cuts into the residual sugars and unbalances the wine. But it’s even worse when the dormant yeast wakes up and starts eating up the acids in the wine.
This is called malolactic fermentation, and if it sounds familiar it’s because it is what gives new world Chardonnay that creamy, buttered-toast flavor. Unwanted malo is usually a serious concern, especially in white wines that rely on acidity for balance and texture, but the winemakers in Minho found that the ensuing slight fizziness caused by this flaw actually made the wine more palatable.-badass-sommelier-lets-drink-some-vinho-verde
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.700 5.200 6.391 9.900 65.800
This distribution looks like a swan, where the dramatic mode around 2 is the neck and then a lumpy tail as its body.
Let’s look at this using a log10 x-axis.
There are a few extreme outliers above 20. The distribution now looks like a camel: it has two humps. Are there two populations within this data set?
If we subset the dataset by residual.sugar value, will it also change the other odd-looking distributions, those for alcohol and density?
whites.dry <- whites %>%
filter(residual.sugar <= 3)
nrow(whites.dry)
## [1] 1885
qplot(residual.sugar, data=whites.dry, binwidth=0.2, fill=I('pink'))
How did this change the distribution for quality, alcohol, and density?
For drier VV wines, the distributions for alcohol and density look considerably closer to a normal distribution. Even quality’s distribution looks more symmetric.
Now for the sweeter wines:
whites.sweet <- whites %>%
filter(residual.sugar > 3)
nrow(whites.sweet)
## [1] 3013
qplot(residual.sugar, data=whites.sweet, binwidth=1, fill=I('pink'))
qplot(residual.sugar, data=whites.sweet, binwidth=1,
fill=I('pink'), color=I('white')) +
scale_x_continuous(limits=c(0,25))
As seen in whites.dry, the distribution for quality looks even more symmetric than for the full dataset. The shape of distributions for alcohol and density, on the other hand, look virtually identical to those of the full dataset.
There are 4898 rows, and 13 variables. All variables are numeric. In order to have at least one categorical variable, I converted quality to a factor with 7 levels.
Most wine is of average quality, a 5 or a 6. The mean quality score is 5.8779094.
alcoholranges from 8 to 14.2%. The most common alcohol content is 9.5%, but the median is 10.4, very close to the mean is 10.514267.
Wine with a residual sugar content greater than 45 is considered sweet. Only one wine in the dataset would therefore be considered sweet. It had the maximum value of 65.8g. Most wines are far below this. Average residual.sugar content is 6.3914149g, while the median is a considerably lower 5.2g.
The main feature I’m interested in are alcohol. The VV wine’s alcohol content determines how specific a region can be used on its label. Usually, the more specific the location of the wine, the higher the price it commands.
I suspect as the region/varietal becomes more specific, its quality will go up.
The other key feature is residual.sugar. It plays a major part in flavour and density. Lower values could indicate secondary fermentation, which could either mean a wine that is pleasantly fizzy, or one that is more round and buttery instead of sharper and acidic. Or else it could be that it’s failed to achieve either and is simply unbalanced.
Chlorides is of interest, because several VV wine reviews I read online spoke, in positive terms, of VV wines with noticeable hints of salt. This seems strange, but in my research I’ve learned that sodium chloride in wine (as opposite to other forms of sodium) creates a soapy taste in wine. While this doesn’t sound appealing, it’s true salt is often used to contrast sweetness in chocolate and caramel.
More relevant to the data at hand, I’ve myself noticed a pleasantly and very subtle salty flavour to Txakoli wine, which has been compared to VV as it’s from Spanish Basque country, a similarly lush, wet region by the Atlantic.
Lastly, fixed.acidity is of interest because one of the classic ways to describe a wine is “a nice balance of sweetness and acidity” or else a “nice balance of alcohol and acidity”. Also, these are considered green wines: youthful, sharp, clean. All features I tend to associate with more acidity.
density, because it provides the body, or “mouthfeel” for wine, which is also something you often hear discussed in wine reviews. It is also strongly related to sugar and alcohol.
higher values of volatile.acidity could indicate wines that have oxidized and so help identify lower-quality wines.
Nope.
Yes: residual.sugar and alcohol, and to a lesser extent, density.
Tranforming residual.sugar with log10 revealed a bimodal distribution with a mode at 2, and another mode around 10. The second mode is harder to read. The density plot actually shows two bumps on this second mode.
To see if these represented two distinct wine populations within the dataset, I created two subsets, whites.dry and whites.sweet, with a residual.sugar value of 3g being the dividing line.
When I replotted the distributions for some key variables, those for whites.dry appeared more symmetrical than in the full dataset. Those for whites.sweet, however, didn’t noticeably change shape.
I would like to compare the correlations for the full dataset and the two subsets.
Which variables have the highest correlation with quality?
First we’ll need to create samples of each of the datasets.
# whites
whites.subset <- subset(whites,select=-X) #improve readability
whites.subset$quality <- as.numeric(whites.subset$quality) #to get correlation
# create sample so it's faster to run
set.seed(888)
whites_samp <- whites.subset[sample(1:nrow(whites.subset), 1000), ]
# whites.dry
whites.dry <- subset(whites.dry, select=-X)
whites.dry$quality <- as.numeric(whites.dry$quality)
set.seed(888)
whites.dry_samp <- whites.dry[sample(1:nrow(whites.dry),1000),]
# whites.sweet
whites.sweet <- subset(whites.sweet, select=-X) #improve readability
whites.sweet$quality <- as.numeric(whites.sweet$quality)
set.seed(888)
whites.sweet_samp <- whites.sweet[sample(1:nrow(whites.sweet),1000),]
whitesggpairs(whites_samp, axisLabels = 'internal',
lower = list(continuous = wrap("smooth", alpha=0.2, shape = I('.'), color='green')),
upper = list(combo = wrap("box", outlier.shape = I('.'))))
The 3 attributes with the highest correlations with quality are:
alcohol: 0.451479density: -0.3273512chlorides: -0.220217whites.dryThe strongest quality correlations for whites.dry are:
alcohol: 0.4272068density: -0.401994volatile.acidity: -0.2307222residual.sugar: 0.2010104While alcohol’s correlation is only slightly lower, density’s negative correlation has gotten stronger. The biggest change is that drier VV wines, volatile.acidity and residual.sugar have relatively high positive correlations as well. In the full dataset, the third & fourth strongest correlations were chlorides and total.sulfur.dioxide (which was only -.17).
This is especially notably for residual.sugar, since it had a very low negative correlation with quality in the full dataset.
whites.sweetFor whites.sweet, the strongest quality correlations are:
alcohol: 0.4833503density: -0.3311279chlorides: -0.2504071total.sulfur.dioxide: -0.2182216These are also the same top 4 correlating variables with the full whites data frames, which is not surprising considering the shape of the distributions for alcohol, density, residual.sugar, and quality are virtually identical for whites and whites.sweet.
TBD.
Ideally, I would like to use whites.dry. I suspect the improved correlation for density and chlorides might be due to the extreme positive outliers having more weight in this subset. In whites.dry by we lost all the positive outliers for residual.sugar and the closely-related density (and perhaps for others too).